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Abstract 

Consider laying out a fixed-topology tree of N nodes into external memory with 
block size B so as to minimize the worst-case number of block memory transfers re- 
quired to traverse a path from the root to a node of depth D. We prove that the 
optimal number of memory transfers is 
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This bound can be achieved even when B is unknown to the (cache-oblivious) layout 
algorithm. 



1 Introduction 

Trees can have a meaningful topology in the sense that edges carry a specific meaning — 
such as letters from an alphabet in a suffix tree or trie — and consequently nodes cannot be 
freely rebalanced. Nontrivial trees also do not fit in the cache closest to the processor, so a 
natural problem is to lay out (store) a tree in a way that minimizes the cost of a root-to- 
node traversal in a multilevel memory hierarchy. Here we consider efficient algorithms for 
laying out a static fixed-topology binary tree in the external-memory and cache-oblivious 
memory-hierarchy models. 

The external-memory model |AV88j (or I/O model or Disk Access Model) defines a mem- 
ory hierarchy of two levels: one level is fast but has limited size, M, and the other level is 
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slow but has unlimited size. Data can be transferred between the two levels in aligned blocks 
of size B, and an algorithm's performance in this model is the number of such memory trans- 
fers. An external-memory algorithm may be parameterized by B and M. The cache- oblivious 
model |FLPR99| requires the additional property that the algorithm is not parameterized 
by B or M, though the number of memory transfers (and analysis) still depend on these 
parameters. One consequence of this property is that an optimal cache-oblivious algorithm 
is simultaneously optimal between every pair of levels in every possible memory hierarchy. 

The general objective in a tree-layout problem is to store the N nodes of a static fixed- 
topology tree in a linear array so as to minimize the number of memory transfers incurred 
by visiting the nodes in order along a path starting at the root of the tree. Each node 
must be stored exactly once. The specific goal in a tree-layout problem varies depending 
on the relative importance of the memory-transfer cost of different root-to-node paths. (It 
is impossible to minimize the number of memory transfers along every root-to-node path 
simult aneously. ) 

Tree-layout problems have been considered before. Clark and Munro |CM96j give a linear- 
time algorithm to find an external-memory tree layout with the minimum worst-case number 
of memory transfers along all root-to-leaf paths. Gil and Itai |G199j give a polynomial-time 
algorithm to find an external-memory tree layout with the minimum expected number of 
memory transfers among a randomly selected root-to-leaf path, given a fixed independent 
probability distribution on the leaves. Alstrup et al. |ABD + 04] give a general transformation 
from external-memory tree layouts to cache-oblivious tree layouts. In particular, they obtain 
polynomial-time algorithms to find a cache-oblivious layout with minimum worst-case or 
expected number of memory transfers along a root-to-leaf path, up to constant factors. 

We consider the natural parameterization of the tree-layout problem by the length D of 
the root-to-node path, i.e., the maximum depth D of the accessed nodes. Without such a 
parameterization, the best worst-case bound that can be stated over all trees is 0(\N/ B~\) 
memory transfers, because the tree might be a path. Parameterized by D, the worst-case 
cost can be substantially better, depending on the relationship between N, B, and D. We 
characterize the worst-case number of memory transfers incurred by a root-to-node path in 
a binary tree, over all possible values of these parameters, as 
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This characterization consists of an external-memory and a cache-oblivious layout algorithm, 
and a matching worst-case lower bound. The external-memory layout algorithm runs in 
O(N) time, and the cache-oblivious layout algorithm runs in O(NlgN) time. These con- 
struction times are measured as CPU time on a RAM; the same upper bounds also hold on 
the number of memory transfers. As in previous work, we do not know how to guarantee a 
substantially smaller number of memory transfers during construction, because on input the 
tree might be scattered throughout memory. 
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2 Upper Bound 



Our layout algorithm consists of two phases. The first phase is simple and achieves the 
desired bound for D = 0(lgN) without significantly raising the cost for larger D. The 
second phase is more complicated, particularly in the analysis, and achieves the desired 
bound for D = Q(lgN). Both phases run in O(N) time. 

2.1 Phase 1 

The first part of our layout simply stores the first G(lgiV) levels according to a B-tree 
clustering, as if those levels contained a perfect binary tree. More precisely, the first block 
in the layout consists of the < B nodes in the topmost \}g(B + 1)J levels of the binary tree. 
Conceptually removing these nodes from the tree leaves 0(B) disjoint trees which we lay out 
recursively, stopping once the topmost clgiV levels have been laid out, for any fixed c > 0. 

This phase defines a layout for a subtree of the tree, which we call the phase- 1 tree. The 
remaining nodes form a forest of nodes to be laid out in the second phase. We call each 
connected tree of this forest a phase-2 tree. 

The number of memory blocks along any root-to-node path within the phase-1 tree, i.e., 
of length D < clgN, is Q(D / \g(B + 1)). More generally, any root-to-node path incurs a 
cost of Q(min{D ,lg N} / \g(B + 1)) within the phase-1 tree, i.e., for the first clgTV nodes. 

2.2 Phase 2: Layout Algorithm 

The second phase defines a layout for each phase-2 tree, i.e., for each connected tree of nodes 
not laid out during the first phase. 

For a node x in the tree, let w(x) be the weight of x, i.e., the number of nodes in 
the subtree rooted at node x. Let £(x) and r(x) be the left and right children of node x, 
respectively. If x lacks a child, t(x) or r(x) is a null node whose weight is defined to be 0. 

For a simpler recursion, we consider a generalized form of the layout problem where the 
goal is to lay out the subtree rooted at a node x into blocks such that the block containing 
the root of the tree is constrained to have at most A nodes, for some nonnegative integer 
A < B, while all other blocks can have up to B nodes. This restriction represents the 
situation when B — A nodes have already been placed in the root block (in the caller to the 
recursion), so space for only A nodes remains. 

Our algorithm chooses a set K(x, A) of nodes to store into the root block by placing the 
root x and divvying up the remaining A — 1 nodes of space among the two children subtrees 
of x proportionally according to weight. More precisely, K(x, A) is defined recursively as 
follows: 

f0 if A < 1, 

K(x, A) = I {x} U K[£(x), (A - 1) • w(£(x))/w(x)\ 

\ U K[r(x), (A — 1) • w(r(x))/w(x)} otherwise. 

Because w(x) = 1 +w(£(x)) +w(r(x)), \K(x,A)\ < A. Also, for positive A, K(x,A) always 
includes the root node x itself. 

At the top level of recursion, the algorithm creates a root block K(r, B), where r is the 
root of the phase-2 tree T, as the first node in the layout of that tree T. Then the algorithm 
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recursively lays out the trees in the forest T — K(r, B), starting with root blocks of K(r', B) 
for each child r' of a node in K(r, B) that is not in K(r, B). 



2.3 Phase 2: Analysis 

Within this analysis, let D denote the depth of the path within this phase-2 tree T (O(lgiV) 
less than the global notion of D). Define the density p(x) of a node x to be w(x) /w(r) where 
r is the root of the phase-2 tree T. In other words, the density of x measures the fraction 
of the entire tree within the subtree rooted at the node x. Let T x denote the subtree rooted 

clX X . 

Consider a (downward) root-to- node path x > x \, ■ ■ ■ > x k where Xq is the root of the tree. 
Define pi = for < i < k, and define % = Pi/pi-i for 1 < i < k. Thus Pi = 

Po<?i<72 ■ ■ ■ 1i = <?i<72 • • -Qi because p = 1. If x k is in the block containing the root x , then 
the number m k of nodes from T Xk that the algorithm places in that block is given by the 
recurrence 

m = B 

m k = (m fc _i - l)q k 

which solves to 

m k = H(S-l)?i-l)?2-l)?3--l)ft-i-l)ft 
k 

= (Bq^ ■ ■ ■ q k ) - {q x q 2 ■ ■ ■ q k ) - (q 2 q 3 ■ ■ ■ q k ) (qk-lQk) ~ 

D Pk Pk Pk 

= Bp k -p k 



Pi Pk-2 Pk~l 

This number is at least 1 precisely when there is room for x k in the block containing the 
root Xq. Thus, if x k is not in the block containing the root Xq, then we must have the 
opposite: 

D Pk Pk Pk ^ , 
Bp k -p k < 1, 

Pi Pk-2 Pk-l 

i.e., 

. Pk . Pk Pk ^ R , 
Pk H 1 1 1 > Bp k - 1. 

Pi Pk-2 Pk-l 

Because p > pi > ■ ■ ■ > p k , each term p k /pi on the left-hand side is at most 1, so the 
left-hand side is at most k. Therefore k > Bp k — 1. 

Let costs (iV, D) = cost(N,D) denote the number of memory blocks of size B visited 
along a worst-case root-to-node path of length D in a tree of N nodes laid out according to 
our algorithm. Certainly cost (A/ - , D) is nondecreasing in iV and D. Suppose the root-to-node 
path visits nodes in the order xq,x\, . . . ,x k , . . ., with x k being the first node outside the block 
containing the root node. By the analysis above, 

cost (TV, D) = cost(Np k , D - k) + 1 

< cost(Np k7 D-p k B + !) + !. 
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This inequality is a recurrence that provides an upper bound on cost (A/ - , D). The base 
cases are cost(l, D) = 1 and cost(iV, 0) = 1. In the remainder of this section, we solve this 
recurrence. 

Define Xk , x kl , Xk 2 , . . . ,x kt to be the first node within each memory block visited along 
the root-to-node path. Thus x kj is the root of the subtree that formed the jth block, so Xk 
is the root of the tree, and k\ = k. As before, define pk = p{xk ■)■ Now we can expand the 
recurrence t times: 

cost(JV,£>) < cost [Nl[p ki ,D- B^Ph + t] +t- 

V t=i i=i / 

So the cost(N,D) recursion terminates when 

A 1 D + t 

[[p ki < or ^p ki > — — , 

i=l iv i=l -° 

whichever comes first. Because t < D, the recursion must terminate once 

* 1 * 2D 

i=l iv i=l D 

whichever comes first. 

Our goal is to find an upper bound on the maximum value of t at which the recursion 
could terminate, because t+ 1 is the number of memory transfers incurred. Define p to be the 
average of the Pfc/s, (p kl + ■ — hPfcJA- m ^ e termination condition, the product Yli=iPki is 
at most Ili=i P because the product of terms with a fixed sum is maximized when the terms 
are equal; and the sum J2l=i Pki is equal to X^=i P- Thus the following termination condition 
happens later than the original termination condition: 

* 1 * 2D 

\p < — or > p > . 

Li ~N h B 

Therefore, by obtaining a worst-case upper bound on t with this termination condition, we 
also obtain a worst-case upper bound on t with the original termination condition. 
Now the cost(Af, D) recursion terminates when 

, 1 2D 

p < — or tp > . 

N y ~ B ' 

i.e., when 

lgN 2D 

t > r- Or t > , 

lg(l/p) Bp 

Thus we obtain the following upper bound on the number of memory transfers along this 
path: 

f lgN 2D) 
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Maximizing this bound with respect to p gives us an upper bound irrespective of p. The 
maximum value is achieved when either p = 0, p = 1, or the two terms in the min are equal. 
At p = 0, the bound is 0, so this is never the maximum. At p = 1, the bound is 2D/B. The 
two terms in the min are equal when, by cross-multiplying, 
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2.4 Putting It Together 



The total number of memory transfers is the sum over the first and second phases. If 
D < clgN, only the first phase plays a role, and the cost is 0(D/\g(B + 1)). If D > clgN, 
the cost is the sum 



O 



clgN IgN D-c\gN 



lg(5 + l) lg(2 +I ggv B 




which is at most 



Because D = Q(lgN), the denominator of the second term is at most lg(-B + 1), so the first 
term is always at most the second term up to constant factors. Thus we focus on the second 
and third terms. If D = XlgN, then the second term is 0((lg A r )/lg(2 + B/X)) and the 
third term is 0((X lg N)/B) = 0((lg N)/(B/X)). For X = 0(B), the second term divides 
lgiV by Q(lg(B/X)), while the third term divides IgN by Q(B/X). Thus the second term 
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is larger up to constant factors for X = 0(B). For X = Q(B), the second term is 0(\gN), 
while the third term is 0((X/B) IgN), which is larger up to constant factors. 

In summary, the first term dominates when D = O(lgTV), the second term dominates 
when D = Q(lgN) and D = O(BlgN), and the third term dominates when D = Q(B\gN). 
Therefore we obtain the following overall bound: 

Theorem 1 Given B and a fixed-topology tree on N nodes, we can compute in O(N) time 
an external-memory tree layout with block size B in which the number of memory transfers 
incurred along a root-to-node path of length D is 

OQgN) N 
fi(lgiV) andD = 0(B\gN) 

Q(B\gN) J 
2.5 Cache-Oblivious Layout 

Our external-memory layout is parameterized by B. This layout can be transformed into a 
single cache-oblivious layout that is independent of B. 

We use a general transformation from external-memory tree layouts to cache-oblivious 
tree layouts by Alstrup et al. |ABD + 04j . This transformation takes as input an arbitrary 
external-memory tree-layout algorithm as a black box, and produces a cache-oblivious layout 
with approximately equal performance. More precisely, the number of memory transfers 
incurred along any root-to-node 1 path in the cache-oblivious layout is at most a constant 
factor larger than the external-memory layout constructed with the machine's true value 
of B. The cache-oblivious layout algorithm applies the external-memory layout black box 
for 0(lgiV) values of B, and combines these layouts into a single linear order of the nodes. 

If we apply this general transformation to our external-memory layout algorithm, we 
obtain a cache-oblivious layout algorithm with the desired properties. Because the number 
of memory transfers along every root-to-node path increases by at most a constant factor, our 
worst-case bound also applies to the cache-oblivious layout. Because our external-memory 
layout requires O(N) construction time for a particular value of B, the total construction 
time of the resulting cache-oblivious layout is O(NlgN). 

Theorem 2 Given a fixed-topology tree on N nodes, we can compute in 0(N\gN) time a 
cache- oblivious tree layout in which the number of memory transfers incurred along a root- 
to-node path of length D satisfies the same bound as Theorem^ 

1 A minor detail is that Alstrup et al. |ABD + f)4| assume that all accesses are to leaves instead of arbitrary 
nodes. This difference is not essential: adding a leaf child to every nonleaf node and treating these two nodes 
as equivalent allows us to assume that all accesses are to leaves. 
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Figure 1: The recursive lower-bound construction: a complete binary tree with 1/p leaves 
attached to 1/p paths of length pB, each attached to a recursive construction. 

3 Lower Bound 

For D < lg(N + 1), the perfectly balanced binary tree on N nodes gives a worst-case lower 
bound of Q(D/ lg B) memory transfers. For all D, any root-to-node path of length D requires 
at least D/B memory transfers just to read the D nodes along the path. Thus we are left 
with proving a lower bound for the case when D = Q(lgN) and D = O(BlgN). 

This lower-bound construction essentially mimics the worst-case behavior predicted in 
Section 12731 We set p to be the solution to Equation^ i.e., to BplgN = D\g(l/p). Because 
D = Q(lgN), this equation implies that 



Using this value of p, we build a tree of slightly more than B nodes, as shown in Figure 
that partitions the space of nodes into 1/p fractions of p. We repeat this tree construction 
recursively in each of the children subtrees stopping at the height that results in iV nodes. 

Consider any external-memory layout of the tree. Because each tree construction has 
more than B nodes, it cannot fit in a block. Thus every tree construction has at least one 
leaf that is not in the same block as the root. Hence, for any k > 1, there is a root-to-node 
path that incurs at least k memory transfers by visiting k tree constructions. Such a path 
has length D = 0(k [pB + \g(l/p)]), which is O(kpB) by Equation^ Therefore 



Bp = Q(\g(l/p)). 



(3) 



The asymptotic solution for 1/p is given by Equation |2J 
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Theorem 3 For any values of N , B, and D, there is a fixed-topology tree on N nodes in 
which every external-memory layout with block size B incurs 
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memory transfers along some root-to-node path of length D. 
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